Handwork

My policy on handwork is that it is required for Statistics PhD students and MS in Statistics students. It is encouraged (but not required) for MS in Applied Statistics and all other students. (If you are the latter, you will not get penalized if you get these wrong … )

Exercises from the book: 14.1, 14.3, 14.6, 15.2, 15.4.

14.1

p<-c(.001,.01,.05,.1,.3,.5,.7,.9,.95,.99,.999)
var<-p*(1-p)
rbind(p,var)
##         [,1]   [,2]   [,3] [,4] [,5] [,6] [,7] [,8]   [,9]  [,10]    [,11]
## p   0.001000 0.0100 0.0500 0.10 0.30 0.50 0.70 0.90 0.9500 0.9900 0.999000
## var 0.000999 0.0099 0.0475 0.09 0.21 0.25 0.21 0.09 0.0475 0.0099 0.000999

Answer: When \(\pi\) is close to 0 or 1, the heteroscedasticity problem is serious.

14.3, 14.6, 15.2, 15.4

You can type your answers and submit within this file or you can do this work on paper and submit a scanned copy (be sure it is legible).

Data analysis

knitr::opts_chunk$set(warning=FALSE,message=FALSE,error=FALSE)

1. Exercise D14.1 (Dichotomous)

For this question, we will use the Chile.txt dataset, which has a polytomous outcome: voting intention (yes, no, abstain, undecided). For this problem, focus only on the subset of the data with outcomes of either ‘yes’ or ‘no’.

  1. Formulate a model that makes substantive sense in the context of the data set - for example,constructing dummy regressors to represent factors and including interaction regressors where these are appropriate - and fit a linear logistic regression of the response variable on the explanatory variables, reporting the estimated regression coefficients and their asymptotic standard errors.

Answer:

From the voting background, I think region, age, sex, education, income and status all could have intersection effects with each other, e.g., for people with different education level, income level has different effects on their voting choices. But from my point of view, population’s effect won’t be influenced so significantly by other factors.

Thus, I add ‘population’ as an individual predictor, while others as both individual and intersective predictors.

library(dplyr)
df0<-read.table("./data/Chile.txt", stringsAsFactors=TRUE)
df1<-df0 %>%
  filter(vote=="Y" | vote=="N") %>%
  na.omit()

fit.0<-glm(vote~(region+sex+age+education+income+statusquo)^2++population, family = binomial, data=df1)
#fit.0<-glm(vote~region+population+sex+age+education+income+statusquo+sex*income+age*income+sex*statusquo+age*statusquo, family = binomial, data=df1)
summary(fit.0)
## 
## Call:
## glm(formula = vote ~ (region + sex + age + education + income + 
##     statusquo)^2 + +population, family = binomial, data = df1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2801  -0.2580  -0.0475   0.1859   3.0063  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            2.655e+00  1.305e+00   2.034  0.04191 *  
## regionM               -8.863e-01  2.181e+00  -0.406  0.68443    
## regionN               -9.147e-01  1.471e+00  -0.622  0.53418    
## regionS               -1.922e+00  1.309e+00  -1.468  0.14216    
## regionSA              -3.228e+00  1.398e+00  -2.309  0.02094 *  
## sexM                  -3.019e+00  9.715e-01  -3.107  0.00189 ** 
## age                   -1.772e-02  2.402e-02  -0.738  0.46054    
## educationPS           -1.049e+00  1.653e+00  -0.635  0.52545    
## educationS            -1.264e-01  9.940e-01  -0.127  0.89880    
## income                 2.446e-05  2.037e-05   1.201  0.22988    
## statusquo              3.732e+00  7.427e-01   5.025 5.03e-07 ***
## population             9.044e-07  1.503e-06   0.602  0.54738    
## regionM:sexM           1.187e+00  1.287e+00   0.922  0.35657    
## regionN:sexM           1.085e+00  7.761e-01   1.398  0.16197    
## regionS:sexM           1.405e+00  7.030e-01   1.998  0.04569 *  
## regionSA:sexM          1.361e+00  6.925e-01   1.965  0.04938 *  
## regionM:age            4.781e-02  4.612e-02   1.036  0.29997    
## regionN:age           -3.423e-03  2.810e-02  -0.122  0.90305    
## regionS:age            8.398e-03  2.515e-02   0.334  0.73842    
## regionSA:age           4.272e-02  2.490e-02   1.716  0.08624 .  
## regionM:educationPS    3.161e+00  3.747e+00   0.844  0.39890    
## regionN:educationPS    8.556e-01  1.392e+00   0.615  0.53878    
## regionS:educationPS    5.503e-01  1.257e+00   0.438  0.66153    
## regionSA:educationPS   8.365e-01  1.256e+00   0.666  0.50526    
## regionM:educationS    -3.134e-01  1.473e+00  -0.213  0.83151    
## regionN:educationS    -7.140e-02  8.948e-01  -0.080  0.93640    
## regionS:educationS    -6.768e-03  7.907e-01  -0.009  0.99317    
## regionSA:educationS    4.612e-01  8.107e-01   0.569  0.56941    
## regionM:income        -5.606e-05  3.973e-05  -1.411  0.15820    
## regionN:income        -6.343e-06  1.502e-05  -0.422  0.67283    
## regionS:income         4.511e-06  1.263e-05   0.357  0.72095    
## regionSA:income       -2.528e-06  1.096e-05  -0.231  0.81762    
## regionM:statusquo     -1.259e+00  1.070e+00  -1.176  0.23951    
## regionN:statusquo     -1.592e+00  6.117e-01  -2.602  0.00925 ** 
## regionS:statusquo     -1.088e+00  5.794e-01  -1.878  0.06043 .  
## regionSA:statusquo    -9.768e-01  5.891e-01  -1.658  0.09729 .  
## sexM:age               2.126e-02  1.606e-02   1.324  0.18559    
## sexM:educationPS       1.356e+00  8.178e-01   1.658  0.09730 .  
## sexM:educationS        1.435e+00  5.162e-01   2.781  0.00542 ** 
## sexM:income           -9.040e-06  8.015e-06  -1.128  0.25934    
## sexM:statusquo        -2.543e-01  3.185e-01  -0.798  0.42466    
## age:educationPS       -3.495e-02  3.596e-02  -0.972  0.33107    
## age:educationS        -1.894e-02  1.812e-02  -1.045  0.29597    
## age:income             2.694e-08  2.819e-07   0.096  0.92388    
## age:statusquo          1.062e-02  1.120e-02   0.949  0.34285    
## educationPS:income    -1.699e-05  1.349e-05  -1.260  0.20776    
## educationS:income     -3.172e-05  1.343e-05  -2.362  0.01817 *  
## educationPS:statusquo  6.569e-01  6.148e-01   1.069  0.28525    
## educationS:statusquo   1.381e-01  3.675e-01   0.376  0.70715    
## income:statusquo       7.877e-06  6.418e-06   1.227  0.21970    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2360.29  on 1702  degrees of freedom
## Residual deviance:  654.56  on 1653  degrees of freedom
## AIC: 754.56
## 
## Number of Fisher Scoring iterations: 7
  1. Construct an analysis-of-deviance table for the model fit in part (a).
library("car")
Anova(fit.0, test="LR")
## Analysis of Deviance Table (Type II tests)
## 
## Response: vote
##                     LR Chisq Df Pr(>Chisq)    
## region                  3.81  4   0.432979    
## sex                     6.98  1   0.008237 ** 
## age                     0.00  1   0.948348    
## education              10.67  2   0.004830 ** 
## income                  0.96  1   0.327144    
## statusquo            1454.36  1  < 2.2e-16 ***
## population              0.36  1   0.546664    
## region:sex              4.96  4   0.290945    
## region:age              5.73  4   0.220422    
## region:education        2.03  8   0.980225    
## region:income           3.05  4   0.548669    
## region:statusquo        7.90  4   0.095323 .  
## sex:age                 1.75  1   0.186086    
## sex:education           8.18  2   0.016768 *  
## sex:income              1.32  1   0.251469    
## sex:statusquo           0.64  1   0.422693    
## age:education           1.57  2   0.456503    
## age:income              0.01  1   0.923840    
## age:statusquo           0.92  1   0.336242    
## education:income        7.36  2   0.025199 *  
## education:statusquo     1.26  2   0.532067    
## income:statusquo        1.76  1   0.184373    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Fit a final model to the data that includes the statistically significant effects. Construct an effect display for each high-order term in the model. If the model is additive, (i) suggest two interpretations of each estimated coefficient; and (ii) construct likelihood-ratio-based 95- percent confidence intervals for the regression coefficients, comparing these with confidence intervals based on the Wald statistic.

Answer:

I choose predictors which are statistically significant in both Wald’s Test and LR Test.

fit.1<-glm(vote~sex+education+statusquo+region*statusquo+sex*education+education*income, family = binomial, data=df1)
library(effects)
plot(predictorEffects(fit.1, ~income, lines=list(multiline=TRUE))) # education*income

plot(predictorEffects(fit.1, ~statusquo, lines=list(multiline=TRUE))) # region*statusquo

plot(predictorEffects(fit.1, ~sex, lines=list(multiline=TRUE))) # sex*education

  1. suggest two interpretations of each estimated coefficient
  • From odds level
summary(fit.1)
## 
## Call:
## glm(formula = vote ~ sex + education + statusquo + region * statusquo + 
##     sex * education + education * income, family = binomial, 
##     data = df1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2366  -0.2526  -0.1007   0.2043   3.0222  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.413e+00  4.047e-01   3.492 0.000479 ***
## sexM               -1.209e+00  3.294e-01  -3.671 0.000242 ***
## educationPS        -1.273e+00  5.782e-01  -2.202 0.027638 *  
## educationS         -5.321e-01  3.994e-01  -1.332 0.182799    
## statusquo           4.071e+00  4.483e-01   9.081  < 2e-16 ***
## regionM            -8.195e-03  5.667e-01  -0.014 0.988463    
## regionN            -3.878e-01  3.892e-01  -0.996 0.319155    
## regionS            -5.867e-01  3.465e-01  -1.693 0.090421 .  
## regionSA           -3.387e-01  3.455e-01  -0.980 0.326952    
## income              1.835e-05  1.000e-05   1.835 0.066541 .  
## statusquo:regionM  -1.860e+00  7.061e-01  -2.635 0.008422 ** 
## statusquo:regionN  -1.334e+00  5.503e-01  -2.425 0.015305 *  
## statusquo:regionS  -9.096e-01  5.185e-01  -1.754 0.079367 .  
## statusquo:regionSA -5.848e-01  5.179e-01  -1.129 0.258835    
## sexM:educationPS    7.822e-01  6.095e-01   1.283 0.199363    
## sexM:educationS     1.109e+00  4.576e-01   2.423 0.015372 *  
## educationPS:income -1.786e-05  1.092e-05  -1.636 0.101853    
## educationS:income  -3.038e-05  1.106e-05  -2.747 0.006013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2360.29  on 1702  degrees of freedom
## Residual deviance:  679.84  on 1685  degrees of freedom
## AIC: 715.84
## 
## Number of Fisher Scoring iterations: 7

Intercept: for female living at central Chile, whose has no income, education level is primary, the scale of support for status-quo is 0, the average vote odd is \(e^{1.413}\)

sexM (including intersection term): keep other predictors fixed, male’s voting odds is \(e^{-1.029+0.7822educationPS_i+1.109educationS_i}\) times female’s voting odds, i.e., for different education level groups of people, the voting odds’ differences between gender is different. This is also shown in the sex effect plot, e.g., among primary education level people, male is obviously less likely (the odds is \(e^{-1.029}\) times smaller) than female to vote Yes, while in secondary group, this difference is very small.

educationPS (educationS is just the same): keep other predictors fixed, on average, the voting odds of people with post secondary education level is \(e^{-1.273+0.7822sexM_i-1.786e-05income_i}\) times people with primary education level’s voting odds. Thus, the effect of education level on voting is related with both gender and income.

statusquo: Use \(\eta_i\) to represent the linear combination of predictors. Keep other predictors fixed, on average, with a unit increase in the scale of support for the status-quo (in fact, I have no idea what’s exactly one unit mean for this predictor…), the odds of voting Yes will be \(e^{\eta_i}*(4.017-1.86regionM_i-1.334regionN_i-0.9096regionS_i-0.5848regionSA_i\) times of change. I.e., the effect of statusquo on voting is different among different regions, but the overall correlation is positive, i.e., the greater the scale of support for stautus-quo is, the more likely people are going to vote Yes.

regionM (the other region dummy variables are just the same): keep other predictors fixed, on average, the voting odds of people living at Metropolitan Santiago area in Chile is \(e^{-0.008195-1.86statusquo_i}\) times voting odds of people living at Central Chile. Thus, the effect is related with scale of support for status-quo, to be more specific, the greater the scale, the negative effect is greater, i.e., in comparison with people living at central Chile, people living in other areas are less likely to vote Yes, and the greater the scale of support for status-quo is, the more less likely for them to vote Yes. This conclusion also holds for other regions.

income: Use \(\eta_i\) to represent the linear combination of predictors. Keep other predictors fixed, on average, when people’s monthly incomes increase by one Pesos, the odds of voting Yes will be \(e^{\eta_i}*(1.835-1.786educationPS_i-3.038educationS_i)*10^{-5}\) times of change. I.e., the effect of income on voting is different among people with different education level, and this distinction is quite significant. For people who have primary education level, the higher their monthly income is, the more likely they are going to vote yes; for people with secondary education level, the effect is on the contrary; for people with education level even higher, it seems that income level doesn’t affect the voting preference.

  • From probability level (centered-model)

Intercept: At the mean level of all the factors, \(logit^{-1}(1.413)\) is the estimated probability of voting yes.

sex, education, region, income: use \(\beta_k\) to represent the coefficient of those predictors, since \(\beta/4\) rule applies to all of them, I will interpret them at a time. At the mean level of all the other predictors (what does it mean for sex to be at the mean level???), each one unit of change in this covariate (one Pesos, one unit of scale of support) corresponding to at most \(\beta_k/4\) difference in probability of voting yes.

intersection: A X B (A and B represent two covariates), if the coefficient of A*B has the same sign with coefficient of A, the importance of A’s effect on the probability of voting yes increases with higher B (if B is a indicator variable, then …of voting yes increases with B property in comparison with baseline property)

center_colmeans <- function(x) {
    xcenter = colMeans(x)
    x - rep(xcenter, rep.int(nrow(x), ncol(x)))
}
X <- model.matrix(terms(fit.1), data = model.frame(fit.1))

X.c<-center_colmeans(X)
fit.1.c<-glm(df1$vote~X, family = binomial)
summary(fit.1.c)
## 
## Call:
## glm(formula = df1$vote ~ X, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2366  -0.2526  -0.1007   0.2043   3.0222  
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          1.413e+00  4.047e-01   3.492 0.000479 ***
## X(Intercept)                NA         NA      NA       NA    
## XsexM               -1.209e+00  3.294e-01  -3.671 0.000242 ***
## XeducationPS        -1.273e+00  5.782e-01  -2.202 0.027638 *  
## XeducationS         -5.321e-01  3.994e-01  -1.332 0.182799    
## Xstatusquo           4.071e+00  4.483e-01   9.081  < 2e-16 ***
## XregionM            -8.195e-03  5.667e-01  -0.014 0.988463    
## XregionN            -3.878e-01  3.892e-01  -0.996 0.319155    
## XregionS            -5.867e-01  3.465e-01  -1.693 0.090421 .  
## XregionSA           -3.387e-01  3.455e-01  -0.980 0.326952    
## Xincome              1.835e-05  1.000e-05   1.835 0.066541 .  
## Xstatusquo:regionM  -1.860e+00  7.061e-01  -2.635 0.008422 ** 
## Xstatusquo:regionN  -1.334e+00  5.503e-01  -2.425 0.015305 *  
## Xstatusquo:regionS  -9.096e-01  5.185e-01  -1.754 0.079367 .  
## Xstatusquo:regionSA -5.848e-01  5.179e-01  -1.129 0.258835    
## XsexM:educationPS    7.822e-01  6.095e-01   1.283 0.199363    
## XsexM:educationS     1.109e+00  4.576e-01   2.423 0.015372 *  
## XeducationPS:income -1.786e-05  1.092e-05  -1.636 0.101853    
## XeducationS:income  -3.038e-05  1.106e-05  -2.747 0.006013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2360.29  on 1702  degrees of freedom
## Residual deviance:  679.84  on 1685  degrees of freedom
## AIC: 715.84
## 
## Number of Fisher Scoring iterations: 7
  • Marginal Effect (average predictive difference)

It’s the average effects of all the coefficients, i.e., with a unit of change in the predictor, the average change in the response (but I am not sure it’s the effects on logit of p or directly on the probability?)

library(margins)
margins(fit.1)
##  statusquo    income     sexM educationPS educationS regionM   regionN  regionS
##     0.1868 1.299e-07 -0.03462    -0.09252   -0.06642 0.03994 -0.005573 -0.02219
##  regionSA
##  -0.01038
  1. construct likelihood-ratio-based 95- percent confidence intervals for the regression coefficients, comparing these with confidence intervals based on the Wald statistic.

We can find that these two confidence intervals are different, that’s because they are based on different statistics.

library(MASS)
# likelihood confidence interval
confint(fit.1)
##                            2.5 %        97.5 %
## (Intercept)         6.407410e-01  2.233121e+00
## sexM               -1.862295e+00 -5.692311e-01
## educationPS        -2.424150e+00 -1.590100e-01
## educationS         -1.318495e+00  2.493140e-01
## statusquo           3.288418e+00  5.064434e+00
## regionM            -1.092702e+00  1.193435e+00
## regionN            -1.169396e+00  3.664779e-01
## regionS            -1.293480e+00  7.280108e-02
## regionSA           -1.040155e+00  3.227211e-01
## income             -1.148481e-06  3.791441e-05
## statusquo:regionM  -3.216615e+00 -3.444426e-01
## statusquo:regionN  -2.477703e+00 -2.833711e-01
## statusquo:regionS  -2.008015e+00  5.253823e-02
## statusquo:regionSA -1.682196e+00  3.757971e-01
## sexM:educationPS   -4.062275e-01  1.986585e+00
## sexM:educationS     2.190783e-01  2.015864e+00
## educationPS:income -3.926441e-05  3.435857e-06
## educationS:income  -5.211281e-05 -8.850258e-06
# Wald confidence interval
confint.default(fit.1)
##                            2.5 %        97.5 %
## (Intercept)         6.200348e-01  2.206250e+00
## sexM               -1.854730e+00 -5.634822e-01
## educationPS        -2.406776e+00 -1.401782e-01
## educationS         -1.315016e+00  2.507592e-01
## statusquo           3.192684e+00  4.950163e+00
## regionM            -1.118946e+00  1.102556e+00
## regionN            -1.150630e+00  3.751288e-01
## regionS            -1.265882e+00  9.244447e-02
## regionSA           -1.015979e+00  3.385263e-01
## income             -1.252417e-06  3.795837e-05
## statusquo:regionM  -3.244387e+00 -4.764319e-01
## statusquo:regionN  -2.413037e+00 -2.559440e-01
## statusquo:regionS  -1.925709e+00  1.065900e-01
## statusquo:regionSA -1.599858e+00  4.302812e-01
## sexM:educationPS   -4.123695e-01  1.976735e+00
## sexM:educationS     2.121093e-01  2.005876e+00
## educationPS:income -3.925385e-05  3.537192e-06
## educationS:income  -5.206282e-05 -8.706076e-06
  1. Fit a probit model to the data, comparing the results to those obtained with the logit model. Which do you think is better? Why?

Answer:

Logit model is better. Because these two models have similar residual deviance and AIC, which means that their abilities to fit the data are similar. But log-odds (odds) have intuitive interpretation which are much easier to interpret and understand than inverse function of normal distribution function.

fit.1.p<-glm(vote~sex+education+statusquo+region*statusquo+sex*education+education*income, family = binomial(link = "probit"), data=df1)
summary(fit.1.p)
## 
## Call:
## glm(formula = vote ~ sex + education + statusquo + region * statusquo + 
##     sex * education + education * income, family = binomial(link = "probit"), 
##     data = df1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4202  -0.2446  -0.0638   0.1861   3.1590  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         7.189e-01  2.102e-01   3.420 0.000626 ***
## sexM               -6.658e-01  1.736e-01  -3.836 0.000125 ***
## educationPS        -6.805e-01  2.975e-01  -2.288 0.022154 *  
## educationS         -2.623e-01  2.097e-01  -1.251 0.211100    
## statusquo           2.185e+00  2.070e-01  10.556  < 2e-16 ***
## regionM             3.042e-02  2.975e-01   0.102 0.918543    
## regionN            -1.673e-01  2.023e-01  -0.827 0.408237    
## regionS            -3.221e-01  1.788e-01  -1.802 0.071601 .  
## regionSA           -1.722e-01  1.795e-01  -0.959 0.337374    
## income              9.965e-06  5.455e-06   1.827 0.067708 .  
## statusquo:regionM  -9.276e-01  3.454e-01  -2.685 0.007251 ** 
## statusquo:regionN  -6.518e-01  2.582e-01  -2.524 0.011598 *  
## statusquo:regionS  -4.893e-01  2.386e-01  -2.050 0.040335 *  
## statusquo:regionSA -2.699e-01  2.412e-01  -1.119 0.263078    
## sexM:educationPS    4.380e-01  3.141e-01   1.395 0.163097    
## sexM:educationS     6.057e-01  2.398e-01   2.526 0.011543 *  
## educationPS:income -9.603e-06  5.883e-06  -1.632 0.102627    
## educationS:income  -1.620e-05  5.980e-06  -2.709 0.006754 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2360.3  on 1702  degrees of freedom
## Residual deviance:  678.6  on 1685  degrees of freedom
## AIC: 714.6
## 
## Number of Fisher Scoring iterations: 7
summary(fit.1)
## 
## Call:
## glm(formula = vote ~ sex + education + statusquo + region * statusquo + 
##     sex * education + education * income, family = binomial, 
##     data = df1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2366  -0.2526  -0.1007   0.2043   3.0222  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.413e+00  4.047e-01   3.492 0.000479 ***
## sexM               -1.209e+00  3.294e-01  -3.671 0.000242 ***
## educationPS        -1.273e+00  5.782e-01  -2.202 0.027638 *  
## educationS         -5.321e-01  3.994e-01  -1.332 0.182799    
## statusquo           4.071e+00  4.483e-01   9.081  < 2e-16 ***
## regionM            -8.195e-03  5.667e-01  -0.014 0.988463    
## regionN            -3.878e-01  3.892e-01  -0.996 0.319155    
## regionS            -5.867e-01  3.465e-01  -1.693 0.090421 .  
## regionSA           -3.387e-01  3.455e-01  -0.980 0.326952    
## income              1.835e-05  1.000e-05   1.835 0.066541 .  
## statusquo:regionM  -1.860e+00  7.061e-01  -2.635 0.008422 ** 
## statusquo:regionN  -1.334e+00  5.503e-01  -2.425 0.015305 *  
## statusquo:regionS  -9.096e-01  5.185e-01  -1.754 0.079367 .  
## statusquo:regionSA -5.848e-01  5.179e-01  -1.129 0.258835    
## sexM:educationPS    7.822e-01  6.095e-01   1.283 0.199363    
## sexM:educationS     1.109e+00  4.576e-01   2.423 0.015372 *  
## educationPS:income -1.786e-05  1.092e-05  -1.636 0.101853    
## educationS:income  -3.038e-05  1.106e-05  -2.747 0.006013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2360.29  on 1702  degrees of freedom
## Residual deviance:  679.84  on 1685  degrees of freedom
## AIC: 715.84
## 
## Number of Fisher Scoring iterations: 7

2. Exercise D14.2 (Polytomous outcome)

Proceed as in Exercise D14.1, but now include all of the data and the four possible outcome values.

Use, as appropriate, one or more of the following: a multinomial logit model; a proportional odds logit model; logit models fit to a set of nested dichotomies; or similar probit models. If you fit the proportional-odds model, test the assumption of parallel regressions. If you fit more than one kind of model, which model do you prefer? Why?

Answer: I prefer the multinomial logit model.

First, a proportional odds logit model need to meet the assumption of parallel regressions, i.e., the effects of all predictors (slope) are the same for different voting choice groups, but it’s hard to justify it without prior theoretical support.

Second, nested dichotomies are suitable for data in which some specific dichotomies have strong explanatory meaning. Although, in some way, {{abstain}, {undecided}};{{yes}, {no}} seems to be a reasonable dichotomies, I would prefer to use more generalized model to avoid making too subjective assumptions ahead of time (before experiment).

# multinomial 
library(nnet)
df2<-df0 %>%
  na.omit()
fit.mul<-multinom(vote~sex+education+statusquo+region*statusquo+sex*education+education*income, data = df2, trace=FALSE)
summary(fit.mul)
## Call:
## multinom(formula = vote ~ sex + education + statusquo + region * 
##     statusquo + sex * education + education * income, data = df2, 
##     trace = FALSE)
## 
## Coefficients:
##   (Intercept)        sexM  educationPS educationS  statusquo  regionM
## N  -0.4302381  1.04072449  0.003603185 -0.3819216 -2.4953983 1.775085
## U   1.8007582  0.18233926 -1.218563973 -0.7016967  0.4908302 1.030827
## Y   0.9774933 -0.05678138 -1.291937225 -0.9230553  2.0363597 1.866984
##      regionN   regionS  regionSA        income statusquo:regionM
## N  0.5072018 1.0382959 0.5721628 -7.916620e-06         0.7350750
## U -0.6156558 0.1283097 0.4615417 -1.007149e-05        -0.3751536
## Y  0.3408189 0.5615416 0.4754139 -3.423454e-06        -1.1303277
##   statusquo:regionN statusquo:regionS statusquo:regionSA sexM:educationPS
## N         0.8628394         0.6642593          0.9086600       -0.6824764
## U        -0.5227324        -0.3629505          0.1400176       -0.9529600
## Y        -0.5880764        -0.3576132          0.3235728       -0.1077670
##   sexM:educationS educationPS:income educationS:income
## N      -0.4349772       1.665489e-05      1.249157e-05
## U      -0.6069996       1.285366e-05      3.918208e-06
## Y       0.1537621       1.350720e-05      1.045914e-06
## 
## Std. Errors:
##    (Intercept)         sexM  educationPS   educationS    statusquo      regionM
## N 4.278119e-10 1.886824e-10 8.755354e-11 9.214273e-11 3.373841e-10 9.562084e-12
## U 3.344499e-10 1.329942e-10 7.358857e-11 1.096929e-10 1.918880e-10 1.330464e-11
## Y 2.146883e-10 8.594397e-11 6.740000e-11 8.501784e-11 1.400180e-10 1.286789e-11
##        regionN      regionS     regionSA       income statusquo:regionM
## N 6.634457e-11 1.113878e-10 1.545221e-10 6.332474e-06      4.825088e-12
## U 4.497765e-11 9.433166e-11 9.541348e-11 5.769082e-06      1.586740e-11
## Y 6.088133e-11 7.650483e-11 3.108129e-11 5.743517e-06      1.175194e-11
##   statusquo:regionN statusquo:regionS statusquo:regionSA sexM:educationPS
## N      4.797515e-11      5.760114e-11       1.525226e-10     5.517518e-11
## U      2.223125e-11      5.137584e-11       7.418159e-11     2.883382e-11
## Y      5.099554e-11      4.991413e-11       1.069570e-11     4.126737e-11
##   sexM:educationS educationPS:income educationS:income
## N    4.802383e-11       7.052818e-06      6.805373e-06
## U    4.230412e-11       6.731373e-06      6.350527e-06
## Y    4.257199e-11       6.603392e-06      6.309124e-06
## 
## Residual Deviance: 3977.642 
## AIC: 4085.642

3. Exercise D15.3 (GLM Diagnostics)

Return to the logit (and probit) model that you fit in Exercise D14.1.

  1. Use the diagnostic methods for generalized linear models described in this chapter to check the adequacy of the final model that you fit to the data.

Answer:

From the Cook’s distance plot, we can find at least one influential point obs 2662. But from the leverage plot, it’s not a high-leverage point, thus, it might has a large residual, i.e., an outlier.

library(faraway)
#leverage point
halfnorm(hatvalues(fit.1))

# Cook's distance
plot(fit.1, which = 4, id.n = 3)

  • Non-linearity

Due to the deviance plots against predicted lineage combination of predictors(\(\eta\)), there is a pattern which is not random, thus, the lineage relationship between X and canonical parameter of Y’s distribution is not very accurate.

Considering the deviance plots against continuous predictors statusquo and income, statusquo’s plot also shows specific pattern, which larger statusquo tends to have positive deviance, while smaller one tends to have negative deviance. As for income’s plot, due to the limited levels of values, it’s hard to judge the pattern.

In conclusion, we might introduce non-linear relationship into the model.

library(arm)
# predicted values
x<-predict(fit.1,type="link")
y<-residuals(fit.1)
binnedplot(x,y,xlab=expression(hat(eta)),ylab="Deviance residuals")

# statusquo
x<-df1$statusquo
y<-residuals(fit.1)
binnedplot(x,y,xlab="statusquo",ylab="Deviance residuals")

# income
x<-df1$income
y<-residuals(fit.1)
binnedplot(x,y,xlab="income", ylab="Deviance residuals")

But due to the goodness of fit test, since we cannot reject null hypothesis, the predicted values approximate the observed values well, i.e., the model makes sense in terms of prediction.

library("regclass")
check_regression(fit.1, extra = TRUE)
## Method 1 (comparing each observation with simulated results given model is correct; not very sensitive)
##   p-value of goodness of fit test is approximately 0.39
## 
## Method 2 (Hosmer-Lemeshow test with 10 categories; overly sensitive for large sample sizes) 
##   p-value of goodness of fit test is approximately 0.496
  1. If the model contains a discrete quantitative explanatory variable, test for nonlinearity by specifying a model that treats this variable as a factor (e.g., using dummy regressors), and comparing that model via a likelihood-ratio test to the model that specifies that the variable has a linear effect. (If there is more than one discrete quantitative explanatory variable, then begin with a model that treats all of them as factors, contrasting this with a sequence of models that specifies a linear effect for each such variable in turn.) Note that this is analogous to the approach for testing for nonlinearity in a linear model with discrete explanatory variables described in Section 12.4.1.

Answer:

There is one discrete quantitative predictor ‘income’ in the model. From the anova outcome, it shows that p-value>0.05, i.e., we are unable to reject the null hypothesis, i.e., there is no lack of fit in terms of the linear assumption for the effect of ‘income’ predictor. We can say that the relationship between ‘income’ and logit of Y is linear.

fit.1.f<-glm(vote~sex+education+statusquo+region*statusquo+sex*education+education*as.factor(income), family = binomial, data=df1)
anova(fit.1, fit.1.f, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: vote ~ sex + education + statusquo + region * statusquo + sex * 
##     education + education * income
## Model 2: vote ~ sex + education + statusquo + region * statusquo + sex * 
##     education + education * as.factor(income)
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      1685     679.84                     
## 2      1670     666.39 15   13.445   0.5679
  1. Explore the use of the log-log and complementary-log-log links as alternatives to the logit link for this regression. Comparing deviances under the different links, which link appears to best represent the data?

Answer:

Since the original model (logit link) has the smallest residual deviance of 679.84, the logit link appears to best represent the data.

fit.loglog<-glm(vote~sex+education+statusquo+region*statusquo+sex*education+education*income, family = binomial("cloglog"), data=df1)
fit.cauchit<-glm(vote~sex+education+statusquo+region*statusquo+sex*education+education*income, family = binomial("cauchit"), data=df1)
anova(fit.1, fit.loglog, fit.cauchit)
## Analysis of Deviance Table
## 
## Model 1: vote ~ sex + education + statusquo + region * statusquo + sex * 
##     education + education * income
## Model 2: vote ~ sex + education + statusquo + region * statusquo + sex * 
##     education + education * income
## Model 3: vote ~ sex + education + statusquo + region * statusquo + sex * 
##     education + education * income
##   Resid. Df Resid. Dev Df Deviance
## 1      1685     679.84            
## 2      1685     729.29  0  -49.446
## 3      1685     722.06  0    7.224

4. Exercise D15.1 (Count data)

Long (1990, 1997) investigates factors affecting the research productivity of doctoral students in biochemistry. Long’s data (on 915 biochemists) are in the file Long.txt. The response variable in this investigation, art, is the number of articles published by the student during the last three years of his or her PhD programme.

The explanatory variables are as follows:

Explanatory variables in `long.txt` data
Variable name Definition
fem Gender: dummy variable - 1 if female, 0 if male
mar Maritial status: dummy variable - 1 if married, 0 if not
kid5 Number of children five years old or younger
phd P Prestige rating of PhD department
ment Number of articles published by mentor during last three years
  1. Examine the distribution of the response variable. Based on this distribution, does it appear promising to model these data by linear least-squares regression, perhaps after transforming the response? Explain your answer.

Answer:

From the plot, we can see that the distribution of art seems to be a poisson distribution, i.e., after the log transformation, is possible to do linear regression, since log link is a common link function.

df4<-read.table("./data/Long.txt")
hist(df4$art, freq = FALSE)

  1. Following Long, perform a Poisson regression of art on the explanatory variables. What do you conclude from the results of this regression?

Answer:

First, gender, mentor’s publication numbers, marital status and kids are statistically significant.

Second, to interpret the coefficients:

female indicator: with other covariates fixed, the average numbers of articles published in last three years of female is e^(-0.224594)=0.7988403 (79.88%) of male’s, i.e., about 20% less;

mentor: when keep other covariates constant, if the mentor of the student have published one more article in the past 3 years, on average, the student tends to publish 1.026 times of articles, i.e., about 2.6% more;

marital status: with other covariates fixed, the average numbers of articles published in last three years of people who are married is e^(0.155243)=1.1679420 (116.79%) of the unmarried’s, i.e., about 16.79% more; Guess: being married is helpful to publishing articles?

kid5: when keep other covariates constant, if the student have one more kids 5 years old or younger, on average, the student tends to publish 0.8312018 times of articles, i.e., about 18% less. Guess: it seems that taking care of young children might have negative effects on the productivity of students.(It needs further exploration, however)

fit.poi<-glm(art~., data = df4, family = "poisson")
summary(fit.poi)
## 
## Call:
## glm(formula = art ~ ., family = "poisson", data = df4)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5672  -1.5398  -0.3660   0.5722   5.4467  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.304619   0.102980   2.958   0.0031 ** 
## fem         -0.224594   0.054613  -4.112 3.92e-05 ***
## ment         0.025543   0.002006  12.733  < 2e-16 ***
## phd          0.012822   0.026397   0.486   0.6271    
## mar          0.155243   0.061374   2.529   0.0114 *  
## kid5        -0.184883   0.040127  -4.607 4.08e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1817.4  on 914  degrees of freedom
## Residual deviance: 1634.4  on 909  degrees of freedom
## AIC: 3314.1
## 
## Number of Fisher Scoring iterations: 5
exp(summary(fit.poi)$coefficients[,1])
## (Intercept)         fem        ment         phd         mar        kid5 
##   1.3561083   0.7988403   1.0258718   1.0129045   1.1679420   0.8312018

After adding interaction terms femmar and femkid5 into the model, we can see from the effect display plots that the effect of marriage is positive for both the gender while male seems to have higher baseline than female; the effect young kids is negative for both gender while female seem to be affected more (steeper slope). But due to the LRT, adding interactive terms doesn’t seem to improve the model a lot.

library(effects)
df4$fem<-factor(df4$fem)
df4$mar<-factor(df4$mar)
fit.poi.2<-glm(art~fem+ment+mar+kid5+fem*mar+fem*kid5, data = df4, family = "poisson")
plot(predictorEffects(fit.poi.2, ~mar, lines=list(multiline=TRUE))) # female*marrital 

plot(predictorEffects(fit.poi.2, ~kid5, lines=list(multiline=TRUE))) # female*kid

anova(fit.poi, fit.poi.2, test = "LRT")
## Analysis of Deviance Table
## 
## Model 1: art ~ fem + ment + phd + mar + kid5
## Model 2: art ~ fem + ment + mar + kid5 + fem * mar + fem * kid5
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       909     1634.4                     
## 2       908     1634.2  1  0.19486   0.6589
  1. Perform regression diagnostics on the model fit in the previous question. If you identify any problems, try to deal with them. Are the conclusions of the research altered?

Answer:

  • Influential data

It seems that obs 186 is not only a high-leverage point, but also an influential point. Check with the data, we find that the other covariates seem to be normal while the mentor has published 77 articles in the last 3 years! Since 77 also seems to be an “outlier” in the “ment” distribution, i.e., much larger than the average level, I suspect that it’s a wrong record, and 7 is more reasonable. But I decide to cancel it instead of changing its value (since I have no idea what the real number should be).

As for another influential point 467, it’s not a high-leverage point, so it should be a poorly-fitted point. After checking it, it’s obvious that it becomes an outlier simply because of its outstanding, i.e., both his publication number (19) and his mentor’s publication number (42) are so amazing that distinguish him from the other students. Since his mentor is so productive, it’s reasonable that this student has so many publications. But he still seems incompatible with this “average” model, thus, I decide to drop this data, too.

#leverage point
halfnorm(hatvalues(fit.poi))

# Cook's distance
plot(fit.poi, which = 4, id.n = 3)

df4[c(186,467),]
##     fem ment  phd mar kid5 art
## 186   0   77 1.78   1    1   1
## 467   0   42 1.86   1    0  19
summary(df4$ment)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   6.000   8.767  12.000  77.000
summary(df4$art)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.693   2.000  19.000
  • Nonlinearity

It seems that the linear combination of predictors is not reasonable, since the residual deviance plots have shown specific pattern. Thus, the model is lack of fit and we need to add nonlinear regressors into the model.

library(arm)
# predicted values
x<-predict(fit.poi,type="link")
y<-residuals(fit.poi)
binnedplot(x,y,xlab=expression(hat(eta)),ylab="Deviance residuals")

# mentor
x<-df4$ment
y<-residuals(fit.poi)
binnedplot(x,y,xlab="mentor's publications",ylab="Deviance residuals")

# income
x<-df4$kid5
y<-residuals(fit.poi)
binnedplot(x,y,xlab="young kids' number",ylab="Deviance residuals")

After adjusting the model due to diagnostic, i.e., drop 2 observations and transform kid5 as a categorical variable, the coefficients’ estimation and their signs haven’t changed, which means, generally speaking, the previous conclusions hold for the adjusted model. The most significance difference is that the new model has shown different individual effects of 1, 2 and 3 young kids, and it’s obvious that the negative effects of kid5 increase with the increase of the number of kids.

df4.c<-df4[-c(186,467),]
fit.poi.c<-glm(art~fem+ment+mar+as.factor(kid5), data = df4.c, family = "poisson")
s.c<-summary(fit.poi.c)$coefficients
s<-summary(fit.poi)$coefficients
summary(fit.poi.c)
## 
## Call:
## glm(formula = art ~ fem + ment + mar + as.factor(kid5), family = "poisson", 
##     data = df4.c)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6386  -1.5502  -0.3765   0.5807   5.4782  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.333805   0.061007   5.472 4.46e-08 ***
## fem1             -0.211952   0.054856  -3.864 0.000112 ***
## ment              0.026102   0.002135  12.227  < 2e-16 ***
## mar1              0.123644   0.063346   1.952 0.050952 .  
## as.factor(kid5)1 -0.126129   0.070915  -1.779 0.075305 .  
## as.factor(kid5)2 -0.299961   0.091538  -3.277 0.001050 ** 
## as.factor(kid5)3 -0.786767   0.281840  -2.792 0.005246 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1759.6  on 912  degrees of freedom
## Residual deviance: 1595.7  on 906  degrees of freedom
## AIC: 3270.6
## 
## Number of Fisher Scoring iterations: 5
s[,1] 
## (Intercept)         fem        ment         phd         mar        kid5 
##  0.30461909 -0.22459418  0.02554275  0.01282193  0.15524324 -0.18488268
s.c[,1]
##      (Intercept)             fem1             ment             mar1 
##       0.33380534      -0.21195178       0.02610168       0.12364353 
## as.factor(kid5)1 as.factor(kid5)2 as.factor(kid5)3 
##      -0.12612928      -0.29996130      -0.78676665
  1. Refit Long’s model allowing for overdispersion (using a quasi-Poisson or negative-binomial model). Does this make a difference to the results?

Answer:

Quasi-poisson doesn’t seem to make any difference (why? so strange…), while negative-binomial model seems to decrease the deviance significantly.

fit.qua<-glm(art~., data = df4, family = quasipoisson(link = "log"))
fit.nb<-glm.nb(art~., data = df4)
anova(fit.poi, fit.qua, fit.nb)
## Analysis of Deviance Table
## 
## Model 1: art ~ fem + ment + phd + mar + kid5
## Model 2: art ~ fem + ment + phd + mar + kid5
## Model 3: art ~ fem + ment + phd + mar + kid5
##   Resid. Df Resid. Dev Df Deviance
## 1       909     1634.4            
## 2       909     1634.4  0     0.00
## 3       909     1004.3  0   630.09